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Abstract. The Thesaurus for the Social Sciences (TheSoz) is a Linked Dataset in SKOS format, which serves as a crucial 
instrument for information retrieval based on e.g. document indexing or search term recommendation. Thesauri and similar 
controlled vocabularies build a linking bridge for datasets from the Linked Open Data cloud. In this article the conversion process 
of the TheSoz to SKOS is described including the analysis of the original dataset and its structure, the mapping to adequate SKOS 
classes and properties, and the technical conversion. In order to create a semantically full representation of TheSoz in SKOS, 
extensions based on SKOS-XL had to be defined. These allow the modeling of special relations like compound equivalences 
and terms with ambiguities. Additionally, mappings to other datasets and the appliance of the TheSoz are presented. Finally, 
limitations and modeling issues encountered during the creation process are discussed. 
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1. Introduction 

The Thesaurus for the Social Sciences (TheSoz) is 
a SKOS-based German thesaurus for the domain of 
the social sciences. It serves as a crucial instrument 
for indexing documents and research information as 
well as for search term recommendation. The TheSoz 
is available in three languages (German, English and 
French) and contains overall about 12,000 keywords, 
from which 8,000 are so-called descriptors, i.e. pre- 
ferred terms for indexing documents, and 4,000 non- 
descriptors, i.e. non-preferred terms, for which pre- 
ferred terms are recommended to be used instead. 
The thesaurus covers all topics and sub-disciplines of 
the social sciences such as sociology, employment re- 
search, pedagogics or political science. Additionally 



terminology from associated and related disciplines 
like economics is included in order to support an ac- 
curate and adequate indexing process of interdisci- 
plinary, praxis-oriented and multicultural documents. 
The thesaurus is owned and maintained by GESIS 
- Leibniz Institute for the Social Science^] an in- 
frastructure organization in Germany, which provides 
research-based infrasttucture services for the social 
sciences. Although TheSoz is specific to the social sci- 
ences, most of the terms are used in the colloquial Ger- 
man language, especially in classic media and political 
texts. 

First attempts lfT2ll for representing the TheSoz in 
SKOS (Simple Knowledge Organization System^jfor- 
mat have been made in 2009 after SKOS has been an- 
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nounced as a standard by the W3C. A lot of organiza- 
tions and libraries have begun bringing their thesauri 
and vocabularies to the web in SKOS format since 
then. 

The SKOS version of the TheSoz is based on an in- 
tellectually maintained database. It is currently avail- 
able in version 0.92 via a SPARQL endpoinj^jas well 
as for downloacflin RDF/XML and RDF/Turtle under 
a Creative Commons Licenc^J Currently, it consists of 
421,083 triples. Each URI is HTTP dereferenceable, 
which has been enabled by a representation via the 
Pubby Linked Data Frontend 0. This HTML repre- 
sentation is accessible at http://lod.gesis.org/thesoz/. A 
user-friendly RDFa-supported interface is planned for 
the future. 

TheSoz also provides crosslinks to other thesauri: 
the ST\\Q the equivalent German thesaurus for eco- 
nomics, the AGROVOQ^jthesaums for the agricultural 
domain as well as to DBpedia^] Although the subject 
matters of the connected thesauri are similar in some 
cases, e.g. between TheSoz and STW, the terms and 
concepts used, are quite differently constructed. Links 
between thesauri build a relevant bridge for the con- 
nection between different Linked Data sources. 

In section 2 of this article the conversion process of 
the TheSoz to SKOS is presented including the defi- 
nition of extensions based on SKOS-XL. Furthermore, 
the use of established vocabularies and links to other 
thesauri are described. In section 3 the appliance of the 
TheSoz for information retrieval purposes is presented. 
Typical knowledge modeling patterns occurring by the 
use of SKOS and SKOS-XL are discussed in section 
4. Section 5 concludes and presents an outlook on the 
future work regarding the TheSoz. 

2. Conversion of the TheSoz to SKOS 

The SKOS version of the TheSoz is created from 
the original source, which is maintained and stored 
in a customized data management system at GESIS. 
Content updates for the thesaurus are released regu- 
larly every month, followed directly by an update of 
the SKOS version. The transformation process of the 
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thesaurus data into SKOS format has been split up 
into three steps and has thereby followed the structured 
method introduced in (T], which consists of the fol- 
lowing steps: (1) analysis of the structure, the extent 
and the complexity of the thesaurus, including con- 
tained terms and relations between terms, (2) a map- 
ping of all detected terms and relations to adequate 
SKOS classes and properties and (3) the technical con- 
version of the thesaurus according to the defined map- 
ping. This method |1| aims to ensure the quality and 
utility of the resulted conversion regarding its inter- 
operability and completeness. All three steps of the 
method are applied once on the thesaurus in order to 
generate an initial SKOS version. The third step, the 
technical conversion and creation of the target output 
is conducted regularly, which depends on updates of 
the content of the thesaurus. 

2.1. Thesaurus Creation 

The basis for a transformation of a thesaurus to 
SKOS format builds a detailed analysis of the the- 
saurus. Therefore, attention has not only been paid to 
terms and existing associative and hierarchical rela- 
tions between them, but also to the general structure 
and design issues of the thesaurus, e.g. the existence 
of an additional classification system or how far the 
thesaurus conforms to established ISO norm^f] Rela- 
tionships between the 8,000 TheSoz descriptors are ex- 
pressed as broader, narrower or related terms. There 
are also "use instead" and "use combination" relations 
and their counterparts ("used for" and "used for combi- 
nation") between descriptors and non-descriptors. Ad- 
ditionally a classification hierarchy is provided and 
each thesaurus term is assigned to one or more classi- 
fication notations. 

For most of the thesaurus terms and relations ade- 
quate SKOS properties and classes can easily be iden- 
tified because of a broad compatibility of the TheSoz 
to the standard norms for thesauri. Problems have been 
observed, when mapping special data items and rela- 
tions, which are not compliant to thesauri standards 
like the AD (alternative non-descriptor) terms of the 
TheSoz. An alternative non-descriptor is used for in- 
dicating ambiguities. Therefore, terms of this type de- 
scribe generic and ambiguous terms, which have dif- 
ferent meanings in specialized sub-contexts. This is 
expressed through the use of multiple "use instead" 
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Table 1 



Overview on personal extensions defined for the TheSoz. 



Extension 


Description 


thesoz! Descriptor 


Descriptors of the TheSoz, which are defined as subclasses of "skosiConcept ". 


thesoz! Classification 


Notation of the classification hierarchy of the TheSoz, which is defined as a subclass 




of "skos:Concept". 


thesoz:EquivalenceRelationship 


An equivalence relationship between two terms, where the terms are assigned via "the- 




soz:use" and "thesoz:usedFor" properties. This is a subclass of "skosxkLabel". 


thesoz:CompoundEquivalence 


A compound equivalence between terms. For constructing "use combination" and 




"used for combination" relations between terms. The non-preferred term is assigned 




by the "thesozxompoundNonPreferrdTerm" property, the preferred terms by the "the- 




soz:preferredTermComponent" property. This is a subclass of "skosxkLabel". 


thesoz:use 


Use relation, which is defined as a subproperty of "skosxklabelRelation". 


thesoz:usedFor 


Used for relation, which is defined as a subproperty of "skosxklabelRelation". 


thesoz:preferredTermComponent 


A preferred term as a component for a "use combination" and "used for combination" 




relation. This property is defined as a subproperty of "skosxklabelRelation". 


thesozxompoundNonPreferredTerm 


The non-preferred term as a component for a "use combination" and "used for combi- 




nation" relation. This property is defined as a subproperty of "skosxklabelRelation". 


thesoz:isPartOfEquivalenceRelationship 


Relation from a term to the class "thesoz:EquivalenceRelationship". 


thesoz:isPartOfCompoundEquivalence 


Relation from a term to the class "thesoz: CompoundEquivalence". 


thesoz:hasTranslation 


Relation between different languages of a term, which is defined as a subproperty of 




" skosxl : labelRelation " . 


thesoz:isTranslationOf 


Inverse property of "thesoz:hasTranslation". 



and/or "use combination" relations for one single term 
at the same time. There are about 200 of such AD 
terms in the TheSoz. 

For example, the term "committee", which is clas- 
sified as an AD term, holds "use" relations to the pre- 
ferred terms "working group", "parliamentary commit- 
tee", "Wirtschaftsausschuss" and "advisory panel" at 
the same time. Additionally, it contains a "use com- 
bination" relation to the combined use of the terms 
"product" and "quality". In this case, it means that the 
term "committee" is in its semantics such general and 
ambiguous, that it is recommended to use a more pre- 
cise term to describe the intended semantics. 

But, because SKOS is based on RDF it is pos- 
sible to define own relations without greater effort. 
Therefore, a precise mapping to SKOS has been more 
complex than initially thought fl2l . where only sim- 
ple relations of the TheSoz have been modeled in 
SKOS. In order to obey the concept-based structure of 
SKOS, but without losing relevant relations between 
preferred and non-preferred terms, classes and proper- 
ties of SKOS-XL (SKOS extension for Labelsf^have 
been used. SKOS-XL has also been used for the rep- 
resentation of the EUROVOCf]] thesaurus (TO). Prop- 
erties of SKOS-XL have been developed explicitly for 
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the representation of lexical issues and provide the 
possibility to model relations between lexical terms in- 
side one SKOS concept. Because of the interconnec- 
tion of terms in the TheSoz descriptors are represented 
as "skos:Concept", but each of the terms is addition- 
ally modeled separately as "skosxl:Label". The prop- 
erty "skosxklabelRelation" allows the definition of ex- 
tensions such as typical equivalence relationships like 
"use" or compound equivalence relationships like "use 
combination". Table 1 provides an overview on the 
personal classes and properties defined for the TheSoz. 

Thus, the term "university ranking" gets the property 
"thesozxompoundNonPreferredTerm" from a class 
of the type "thesoz:CompoundEquivalence". The two 
other preferred terms ("university" and "ranking"), 
which should be used instead, are assigned by the 
property "thesoz:preferredTermComponent" from the 
same class. This construct indicates that both preferred 
terms have to be used together instead of the other term 
(see Figure 1). 

This modeling approach allows a consistent rep- 
resentation of the alternative non-descriptors of the 
TheSoz. Thus, the AD term "committee" and its re- 
lationships to other terms can be modeled as de- 
picted in Figure 2. It is assigned by multiple "the- 
soz:usedFor" properties from different classes of the 
type "thesoz:EquivalenceRelationship", each of which 
also holds "thesoz:use" properties to the preferred 
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Fig. 1 . Representation of a "use combination" relation in the TheSoz. 



Fig. 3. Links from the term "quality assurance" to other Datasets. 




Fig. 2. Representation of the AD term "committee" in the TheSoz. 

terms "working group", "parliamentary committee", 
"Wirtschaftsausschuss" and "advisory panel". Ad- 
ditionally, the "thesozxompoundNonPreferredTerm" 
property is assigned by a class of the type "the- 
soz:CompoundEquivalence", which models the "use 
combination" relation for the combined use of the 
terms "product" and "quality". Without SKOS-XL 
based extensions, the ambiguities of the term would be 
lost. 

According to the Linked Data principles, all con- 
cepts, terms and classification notations get their own 
URI, which provides a persistent and unique iden- 
tification of each elements. This is a very impor- 
tant aspect for re-using and linking TheSoz terms 
on the web. All URIs are listed in the context path 
http://lod.gesis.org/thesoz/, which serves as base URI. 
The URI has been chosen according to naming conven- 
tions of web addresses of GESIS and in order to leave 



room for the publication of further datasets as Linked 
Data. The namespace of the personal classes and prop- 
erties is defined at http://lod.gesis.org/thesoz/ext/ and 
is shortened by the prefix "thesoz". 

2.2. Usage of existing Vocabularies 

Beside the SKOS standard, additional established 
metadata vocabularies are used to represent the The- 
Soz. For citation, licensing and provenance purposes 
properties of Dublin Core, OWL, Creative Commons 
and the Provenance Vocabularjf^jhave been used. De- 
tailed Information on links to other datasets is repre- 
sented using the VoiD vocabular^p] This dataset de- 



scription is available at http://lod.gesis.org/thesoz/void.ttl 
The SKOS-XL extensions of TheSoz are defined in 
detail using RDF Schema. This ensures further pro- 
cessing and interoperability with other datasets on the 
web. 

2.3. Links to other Datasets 

Currently, the SKOS version of the TheSoz holds 
links to the STW Thesaurus for Economics of the 
ZB W and the AGROVOC thesaurus of FAO as well as 
to DBpedia. Such links are important for the use of the 
thesaurus as a terminology hub during information re- 
trieval, e.g. for search and query expansion. The SKOS 
mapping properties provide standard relations in order 
to represent links of SKOS concepts between different 
concept schemes, i.e. between different datasets. Table 
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Table 2 




Number of Links from TheSoz to other Datasets. 


Dataset 


Number and Type of Links 


STW 


4927 (2844 exact matches, 631 related matches, 1418 broad matches, 34 narrow matches) 


AGROVOC 


846 (840 exact matches, 6 close matches) 


DBpedia 


5024 (all exact matches) 



2 provides an overview on number and kind of links 
from TheSoz to other datasets. 

These links have been established differently. While 
the links to STW have been made manually during 
a major mapping initiative E) and have been con- 
verted to SKOS via XSL transformation afterwards 
0, the links to AGROVOC and DBpedia have been 
detected by semi-automatic approaches, which results 
have been evaluated by domain experts afterwards. 
The links to AGROVOC have been identified by a 
distance measure approach using the Levenshtein dis- 
tance with a threshold of 0.21 [6|. The links to DB- 
pedia have been detected using a standard string sim- 
ilarity algorithm. Figure 3 depicts links from the term 
"quality assurance" to STW, AGROVOC and DBpedia. 



3. Usage of the TheSoz 

TheSoz is the main tool for indexing research lit- 
erature in the German speaking social sciences and 
is applied among other disciplinary information sys- 
tems in the databases SOLIS (Social Science Litera- 
ture Information System), SOFIS (Social Science Re- 
search Information System), SSOAR (Social Science 
Open Access Repository) all owned and maintained by 
GESIS. The thesaurus is used for document retrieval 
in the information portal sowiporj^] which is visited 
by about 11,000 unique users per month. This por- 
tal uses TheSoz as a terminology hub for search and 
query expansion, which enables the retrieval of more 
than 7,000,000 research documents on a semantically 
very precise level. Other databases of sowiport (e.g. 
CSA Sociological Abstracts, PAIS International) are 
indexed with other thesauri than TheSoz. To overcome 
this heterogeneity links between terms of these the- 
sauri are used to support a meaningful keyword search. 
The links are used automatically to expand the search 
terms for the retrieval inside the other non-TheSoz- 
indexed databases. In [3 1 the effect of using these map- 
pings for intra- and interdisciplinary search questions 



has been evaluated in a controlled scenario. The expan- 
sion with "exact match"-mappings shows a very posi- 
tive effect in terms of retrieval precision and recall. 

TheSoz and its links to other thesauri are also used 
for the construction and provision of specialized search 
term recommendation functionalities, which are out- 
lined and evaluated in (2). In this case the best rec- 
ommender has been a system, which starts with rec- 
ommending TheSoz descriptors and leading to fur- 
ther highly-associated TheSoz descriptors, which have 
been listed on the basis of a pre-computed co-word 
analysis. 



4. Modeling Issues 

During the modeling process, several obstacles with 
the use of SKOS have been observed. Even if a the- 
saurus meets established ISO norms for thesauri, a 
conversion to SKOS is not always as trivial as expected 
II 1 171 101 : Relations of the original thesaurus cannot al- 
ways be modeled adequately in SKOS. For TheSoz 
especially the representation of compound relation- 
ships and compound concepts has been an obstacle. 
This has also been required by HI 1101 . because com- 
pound concepts are part of the ISO 2788 standard. Crit- 
ical modeling issues are discussed in an overview on 
correspondences between the ISO norms 2788/5964 
and SKOSp^| inside the SKOS Primer. For syntactical 
compositions of terms like compound equivalences, 
it is suggested to define personal extensions either of 
"skos:Concept" or "skosxkLabel". For the SKOS ver- 
sion of TheSoz subclasses of both and subproperties of 
"skosxl:labelRelation" have been defined. IflOll applied 
these relations in a similar way for the EUROVOC the- 
saurus. [7 | has defined a personal construct called "zb- 
wext:useInsteadNote" as a subproperty of "skos:note", 
which holds information about a "use instead" relation. 

When modeling mappings between thesauri in SKOS 
format, inconsistencies and problems can occur which 
are caused by idiosyncrasies in thesauri. A reason for 
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inconsistencies can be that given mappings have been 
defined on term-based thesauri before their conversion 
to SKOS. Such links cannot always be modeled di- 
rectly with the SKOS mapping properties. It has to be 
investigated, if the two terms of a given mapping rep- 
resent adequate concepts in the corresponding SKOS 
versions by e.g. being used as "skos:prefLabel" in a 
concept. Mappings between non-preferred terms can- 
not directly be modeled in SKOS, because the map- 
ping properties of SKOS can only be applied between 
concepts. Although ISO 5964 allows relations between 
non-preferred terms, it is only possible defining and 
using SKOS-XL extensions. 

Domain-specific differences in thesauri can cause 
conversion problems either. A concept in one thesaurus 
might correspond to a combination of two concepts in 
another thesaurus, e.g. the term "Electronic Govern- 
ment" of the TheSoz has originally been mapped to 
the combination of the terms "Public Administration" 
and "Internet" of the STW. The mapping properties of 
SKOS do not allow such single-to-multiple relations 
(neither for one language nor for multiple languages). 

Transforming existing vocabularies and thesauri to 
SKOS remains a complex process according to the het- 
erogeneous structure of the involved vocabularies. If 
semantically more complex relations are required, i.e. 
"use combination" relations between terms of the same 
or different vocabularies, personal extensions still have 
to be defined in order to preserve the relevant infor- 
mation. But, such extensions can lead to incompatibil- 
ities to other SKOS datasets or with applications for 
processing data in SKOS format, e.g. SKOS thesaurus 
management tools. Therefore extensions should be de- 
scribed using standard classes and properties, e.g. with 
RDF Schema, so that there is a chance, that such data 
is processible at least in a minimal way. 



5. Conclusion and Future Work 

In this paper, the Linked Dataset TheSoz, a widely 
used thesaurus in the domain of the social sciences, has 
been presented. It has been brought to SKOS format 
by the use of SKOS-XL extensions. This has been nec- 
essary because of specific and complex relations of the 
original thesaurus like compound equivalences. 

It is planned to add more links to datasets like 
EUROVOC, the Integrated Authority File (GND^ 



dataset of the German National Library and more. For 
some of these tasks it is considered to use link discov- 
ery tools like Silk ifTTI or Amalgame JS). 

TheSoz and other linked vocabularies play a vi- 
tal role for connecting heterogeneous GESIS datasets 
like the literature collections of sowiport or the GESIS 
Data Catalogue^ which comprises study descriptions 
of survey data, with each other and with other Linked 
Datasets from the web. By using Linked Data we 
aim to provide an integrated view on documents and 
datasets relevant for social science research. 

TheSoz participates as test dataset in the Library 
Track at the upcoming OAEI 201^] together with the 
STW thesaurus. The existing links between both the- 
sauri serve as reference alignments. 
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